M1L3 Homework Assignment

Exploratory Data Analysis assignment

Answer

library("ggplot2")
library("reshape2")
library("e1071")                   

file_L3<- "/Users/fanxueyi/Documents/NEU Bioinformatics/DSCS6030 Intro Data Mining:Machine Learing/Module1_Getting_to_Know_a_Data_Set/Assignment/M01_quasi_twitter.csv"

L3Q1_data <- read.csv(file_L3)
str(L3Q1_data)
## 'data.frame':    21916 obs. of  25 variables:
##  $ screen_name            : Factor w/ 21916 levels "+5400E1.","000D0se7",..: 4341 15303 21127 13570 14085 3607 14942 8653 15547 19146 ...
##  $ created_at_month       : int  2 11 4 3 4 2 7 5 1 1 ...
##  $ created_at_day         : int  9 21 1 24 23 9 15 23 23 13 ...
##  $ created_at_year        : int  2007 2009 2007 2007 2009 2009 2006 2008 2009 2009 ...
##  $ country                : Factor w/ 44 levels " Germany","Argentina",..: 44 19 19 44 44 12 44 5 44 44 ...
##  $ location               : Factor w/ 378 levels "Akron Ohio","Alabama",..: 188 202 25 233 211 79 365 41 242 83 ...
##  $ friends_count          : int  1087 5210 1015 338 641 917 1574 16300 8316 640 ...
##  $ followers_count        : int  22187643 6692814 6257020 3433218 2929559 2540842 1960373 1934803 1855827 1697620 ...
##  $ statuses_count         : int  60246 93910 118465 78082 93892 59397 41023 62178 56057 82912 ...
##  $ favourites_count       : int  1122 3825 1143 0 226 2122 20160 15 540 3 ...
##  $ favourited_count       : int  105005 40487 87968 25943 32589 19760 13558 25084 8732 24515 ...
##  $ dob_day                : int  29 24 4 22 9 1 2 6 15 26 ...
##  $ dob_year               : int  1999 1991 1997 1998 1963 1995 1999 1986 1991 1986 ...
##  $ dob_month              : int  4 10 3 8 11 1 11 10 2 9 ...
##  $ gender                 : Factor w/ 2 levels "female","male": 1 1 2 2 1 1 1 2 1 2 ...
##  $ mobile_favourites_count: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mobile_favourited_count: int  0 5032191 0 0 0 0 0 1934803 0 0 ...
##  $ education              : int  8 15 9 9 13 15 14 10 11 12 ...
##  $ experience             : int  0 0 0 44 24 21 31 0 27 20 ...
##  $ age                    : int  29 0 32 40 45 14 27 31 34 40 ...
##  $ race                   : Factor w/ 10 levels "arab","asian",..: 10 10 10 10 10 10 10 10 2 1 ...
##  $ wage                   : num  16.3 17.9 15.7 7 17.9 ...
##  $ retweeted_count        : int  1 1 2 0 1 2 1 2 0 0 ...
##  $ retweet_count          : int  30 6 65 8 7 64 13 14 15 10 ...
##  $ height                 : int  156 162 168 180 162 158 160 178 156 173 ...
names(L3Q1_data)
##  [1] "screen_name"             "created_at_month"       
##  [3] "created_at_day"          "created_at_year"        
##  [5] "country"                 "location"               
##  [7] "friends_count"           "followers_count"        
##  [9] "statuses_count"          "favourites_count"       
## [11] "favourited_count"        "dob_day"                
## [13] "dob_year"                "dob_month"              
## [15] "gender"                  "mobile_favourites_count"
## [17] "mobile_favourited_count" "education"              
## [19] "experience"              "age"                    
## [21] "race"                    "wage"                   
## [23] "retweeted_count"         "retweet_count"          
## [25] "height"

Column1 Screen_name

head(L3Q1_data$screen_name)
## [1] CNN      osbrFe   WSJ      ninc     nssubies BNCC    
## 21916 Levels: +5400E1. 000D0se7 001apdov 001RBTePh 003B0K2 ... zzzrnfoia

This column contains different names, This column doesn’t have distribution

Column2,3,4 created_at_month, day, year

  • created_at_month
#distribution
qplot(created_at_month, data = L3Q1_data,geom = "bar") 

#summary statistics
summary(L3Q1_data$created_at_month)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   6.000   6.069   9.000  12.000

The data created_at_month is from uniform distribution. March and Apirl have more creations than other month.

  • created_at_day
#distribution
qplot(created_at_day, data = L3Q1_data,geom = "bar") 

#summary statistics
summary(L3Q1_data$created_at_day)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   16.00   15.78   23.00   31.00

The data created_at_day is from uniform distribution. It seems users don’t have a preference on the data to create a account.

  • created_at_year
#distribution
qplot(created_at_year, data = L3Q1_data,geom = "bar") 

#summary statistics
summary(L3Q1_data$created_at_year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2006    2009    2011    2011    2013    2015

It’s hard to decide which kind of distribution that the created_at_year comes from. But the plot shows that at the beginning years, the number of usrer increases.

Column5 country

#distribution
qplot(country, data = L3Q1_data,geom = "bar") 

#summary statistics
summary(L3Q1_data$country)
##              Germany            Argentina            Australia 
##                   54                  190                  291 
##              Belgium               Brazil               Canada 
##                   56                  121                  943 
##                Chile                China             Colombia 
##                   61                   57                   50 
##              Denmark                Earth              England 
##                   59                  516                  467 
##       European Union              Finland               France 
##                   58                  110                  180 
##              Germany               Greece            Hong Kong 
##                  151                   59                   60 
##                India              Ireland               Israel 
##                  890                  171                   52 
##                Italy                Japan                Kenya 
##                  116                  162                  117 
##               Kuwait           Luxembourg             Malaysia 
##                   49                   62                   55 
##               Mexico          Netherlands              Nigeria 
##                  236                  170                  132 
##               Panama          Philippines             Portugal 
##                   59                   53                   49 
##               Russia             Scotland            Singapore 
##                   63                   57                   53 
##         South Africa                Spain               Sweden 
##                  183                  283                  123 
##          Switzerland               Turkey United Arab Emirates 
##                  115                   73                   56 
##       United Kingdom                  USA 
##                  149                14905

The plot shows that most of the users of twitter come from USA which is far greater than other countries.

Column6 location

#distribution
qplot(location, data = L3Q1_data,geom = "bar") 

#summary statistics
summary(L3Q1_data$location)
##                 Mexico                 Boston               Montreal 
##                    122                    108                    107 
##                 Nevada              Bangalore   Indianapolis Indiana 
##                     80                     79                     76 
##             Pune India           Dallas Texas          New Hampshire 
##                     75                     74                     74 
##           Cambridge MA               Istanbul           Vancouver BC 
##                     73                     73                     73 
##      Brooklyn New York             Fremont CA  London United Kingdom 
##                     72                     72                     72 
##              Minnesota                     NY             Raleigh NC 
##                     72                     72                     72 
##                Arizona             california          Houston Texas 
##                     71                     71                     71 
##                Nigeria                 Philly            San Jose CA 
##                     71                     71                     71 
##    San Jose California        The Netherlands Buenos Aires Argentina 
##                     71                     71                     69 
##          Miami Florida       Orange County CA            Richmond VA 
##                     69                     69                     69 
##            Tokyo Japan              Cleveland           Maryland USA 
##                     69                     68                     68 
##           New York USA                     PA           South Africa 
##                     68                     68                     68 
##          Alexandria VA                     CT                 Espana 
##                     67                     67                     67 
##             Houston TX            Kansas City               Nebraska 
##                     67                     67                     67 
##             Phoenix AZ           San Diego CA            Stamford CT 
##                     67                     67                     67 
##                 Sweden              The World             Toronto ON 
##                     67                     67                     67 
##             Buffalo NY                Chennai             East Coast 
##                     66                     66                     66 
##              hyderabad                Indiana      Madison Wisconsin 
##                     66                     66                     66 
##          Massachusetts            Mississippi               new york 
##                     66                     66                     66 
##            Pasadena CA           Rhode Island       Santa Barbara CA 
##                     66                     66                     66 
##                     SF       Sydney Australia Toronto Ontario Canada 
##                     66                     66                     66 
##                    usa            Columbus OH                     MA 
##                     66                     65                     65 
##    Melbourne Australia                     NJ        Orlando Florida 
##                     65                     65                     65 
##          Ottawa Canada           Rochester NY   San Diego California 
##                     65                     65                     65 
##              Somewhere               Virginia                  Earth 
##                     65                     65                     64 
##           Indianapolis           Johannesburg       Oklahoma City OK 
##                     64                     64                     64 
##        Portland Oregon                Seattle               Stockton 
##                     64                     64                     64 
##                     TX                    ATL           Cleveland OH 
##                     64                     63                     63 
##          Fort Worth TX                Ireland         Kansas City MO 
##                     63                     63                     63 
##            Mexico City                 Moscow             Orlando FL 
##                     63                     63                     63 
##          san francisco         Silicon Valley                Atlanta 
##                     63                     63                     62 
##             Atlanta GA        Bangalore India          Cincinnati OH 
##                     62                     62                     62 
##          Columbus Ohio        Denver Colorado                Finland 
##                     62                     62                     62 
##                (Other) 
##                  15131

The location data are from uniform distribution. but there are three locations have a greater value than other locations which seem to be outliers.

Column7,8,9,10,11 friend_count, followers_count, statuses_count, favourites_count, favourited_count

  • friend_count
#summary statistics
summary(L3Q1_data$friends_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     -84     123     324    1058     849  660500
#distribution
qplot(friends_count, data = L3Q1_data) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Outliers
qplot(1, friends_count, data = L3Q1_data, geom = "boxplot")

The summary statistics show that the range of data is from -84 to 660500, the mean is 1058 and median is 324. The minimum number is negative which is outlier.

The boxplot also show there are at least 9 outliers above the top whisker.

This histogram couldn’t tell us the distribution of the data because the existence of outliers. So I zoom in the range from 0 to 2e+03

#distribution
qplot(friends_count, data = L3Q1_data, xlim = c(0,2*10^3) )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1732 rows containing non-finite values (stat_bin).

qplot(friends_count, data = L3Q1_data, xlim = c(0,2*10^3), geom = "density")
## Warning: Removed 1732 rows containing non-finite values (stat_density).

The major number of friends are around 100.

  • followers_count
#summary statistics
summary(L3Q1_data$followers_count)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0      105      336     5859     1075 22190000
#distribution
qplot(followers_count, data = L3Q1_data) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Outliers
qplot(1, followers_count, data = L3Q1_data, geom = "boxplot")

The summary statistics show that the range of data is from 0 to 22190000, the mean is 5859 and median is 336. The median is not equal to the mean which implicate that there are some outlieres in the data.

The boxplot also show there are at least 6 outliers above the top whisker.

This histogram couldn’t tell us the distribution of the data because the existence of outliers. So I zoom in the range from 0 to 5.0e+03

#distribution
qplot(followers_count, data = L3Q1_data, xlim = c(0,5*10^3) ) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1870 rows containing non-finite values (stat_bin).

qplot(followers_count, data = L3Q1_data, xlim = c(0,2*10^3), geom = "density")
## Warning: Removed 3563 rows containing non-finite values (stat_density).

The major number of followers are around 100.

  • statuses_count
#summary statistics
summary(L3Q1_data$statuses_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     558    2341   12490    9348 1136000
#distribution
qplot(statuses_count, data = L3Q1_data) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Outliers
qplot(1, statuses_count, data = L3Q1_data, geom = "boxplot")

The summary statistics show that the range of data is from 1 to 1136000, the mean is 12490 and median is 2341. The median is not equal to the mean which implicate that there are some outlieres in the data.

The boxplot also show there are at least 15 outliers above the top whisker.

This histogram couldn’t tell us the distribution of the data because the existence of outliers. So I zoom in the range from 0 to 25000

#distribution
qplot(statuses_count, data = L3Q1_data, xlim = c(-1,25000) ) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2490 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

qplot(statuses_count, data = L3Q1_data, xlim = c(-1,25000), geom = "density")
## Warning: Removed 2490 rows containing non-finite values (stat_density).

The major number of followers are around 1000.

  • favourites_count
#summary statistics
summary(L3Q1_data$favourites_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      16     164    2217     950 1140000
#distribution
qplot(favourites_count, data = L3Q1_data) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Outliers
qplot(1, favourites_count, data = L3Q1_data, geom = "boxplot")

The summary statistics show that the range of data is from 0 to 1140000, the mean is 2217 and median is 164. The median is not equal to the mean which implicate that there are some outlieres in the data.

The boxplot also show there are at least 10 outliers above the top whisker.

This histogram couldn’t tell us the distribution of the data because the existence of outliers. So I zoom in the range from 0 to 10000

#distribution
qplot(favourites_count, data = L3Q1_data, xlim = c(0,10^4) ) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 934 rows containing non-finite values (stat_bin).

qplot(favourites_count, data = L3Q1_data, xlim = c(0,10^4), geom = "density")
## Warning: Removed 934 rows containing non-finite values (stat_density).

The major number of followers are around 500.

  • favourited_count
#summary statistics
summary(L3Q1_data$favourited_count)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      0.00      2.00      9.00     92.24     36.00 105000.00
#distribution
qplot(favourited_count, data = L3Q1_data) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Outliers
qplot(1, favourited_count, data = L3Q1_data, geom = "boxplot")

The summary statistics show that the range of data is from 0 to 105000, the mean is 92.24 and median is 9. The median is not equal to the mean which implicate that there are some outlieres in the data.

The boxplot also show there are at least 10 outliers above the top whisker.

This histogram couldn’t tell us the distribution of the data because the existence of outliers. So I zoom in the range from 0 to 7000

#distribution
qplot(favourited_count, data = L3Q1_data, xlim = c(0,1000) ) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 258 rows containing non-finite values (stat_bin).

qplot(favourited_count, data = L3Q1_data, xlim = c(0,1000), geom = "density")
## Warning: Removed 258 rows containing non-finite values (stat_density).

The major number of followers are around 30.

However,the histogram of all count column can’t show the distribution of these data. The data need to do some transformation.

Column12,13,14 dob_day, dob_year, dob_month

  • dob_day
#distribution
qplot(dob_day, data = L3Q1_data,geom = "bar") 

#summary statistics
summary(L3Q1_data$dob_day)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    5.00   13.00   13.49   21.00   35.00

The data dob_day is from uniform distribution.But there are more people on first day of month than other days.

  • dob_year
#distribution
qplot(dob_year, data = L3Q1_data,geom = "bar") 

#summary statistics
summary(L3Q1_data$dob_year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1900    1965    1982    1976    1990    2000

The data dob_year are from 1900 to 2000. However, the year 1900 is not what we expect which may be a outlier. Most of people are bone in 1987 and 1989.

  • dob_month
#distribution
qplot(dob_month, data = L3Q1_data) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#summary statistics
summary(L3Q1_data$dob_month)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    3.000    6.000    6.398    9.000 1992.000

The data dob_month are from 1 to 1992. Obviously, the max number is a outliers.

#distribution
qplot(dob_month, data = L3Q1_data, xlim = c(0,12)) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing non-finite values (stat_bin).

The distribution of dob_month is unifrom distribution. However, more people are born on Janunary than other months.

Column15 gender

#distribution
qplot(gender, data = L3Q1_data) 

#summary statistics
summary(L3Q1_data$gender)
## female   male   NA's 
##   7319  14569     28

There are more male than female using the twitter.

Column16,17 mobile_favourites_count, mobile_favourited_count

  • mobile_favourites_count
#summary statistics
summary(L3Q1_data$mobile_favourites_count)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0      0.0      0.0    152.9      0.0 377100.0
#distribution
qplot(mobile_favourites_count, data = L3Q1_data) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Outliers
qplot(1, mobile_favourites_count, data = L3Q1_data, geom = "boxplot")

The summary statistics show that the range of data is from 0 to 377100, the mean is 152.9 and median is 0.0. The median is not equal to the mean which implicate that there are some outlieres in the data.

The boxplot also show there are at least 6 outliers above the top whisker.

This histogram couldn’t tell us the distribution of the data because the existence of outliers. So I zoom in the range from 0 to 1000

#distribution
qplot(mobile_favourites_count, data = L3Q1_data, xlim = c(0,1000) ) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 359 rows containing non-finite values (stat_bin).

qplot(mobile_favourites_count, data = L3Q1_data, xlim = c(0,1000), geom = "density")
## Warning: Removed 359 rows containing non-finite values (stat_density).

The major number of followers are around 0.

However,the histogram of all count column can’t show the distribution of these data. The data need to do some transformation.

  • mobile_favourited_count
#summary statistics
summary(L3Q1_data$mobile_favourited_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0     649       0 5032000
#distribution
qplot(mobile_favourited_count, data = L3Q1_data) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Outliers
qplot(1, mobile_favourited_count, data = L3Q1_data, geom = "boxplot")

The summary statistics show that the range of data is from 0 to 5032000, the mean is 649 and median is 0. The median is not equal to the mean which implicate that there are some outlieres in the data.

The boxplot also show there are at least 6 outliers above the top whisker.

This histogram couldn’t tell us the distribution of the data because the existence of outliers. So I zoom in the range from 0 to 1000

#distribution
qplot(mobile_favourited_count, data = L3Q1_data, xlim = c(0,1000) ) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 489 rows containing non-finite values (stat_bin).

qplot(mobile_favourited_count, data = L3Q1_data, xlim = c(0,1000), geom = "density")
## Warning: Removed 489 rows containing non-finite values (stat_density).

The major number of followers are around 0.

However,the histogram of all count column can’t show the distribution of these data. The data need to do some transformation.

Column18 education

#summary statistics
summary(L3Q1_data$education)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0    11.0    13.0    12.5    14.0    24.0
#distribution
qplot(education, data = L3Q1_data) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Outliers
qplot(1, education, data = L3Q1_data, geom = "boxplot")

The summary statistics show that the range of data is from 3 to 24, the mean is 12.5 and median is 13. Histogram shows that these data looks like poisson distribution. And the number of people having 12.5 years education are greater than other years.

The boxplot shows that there are some outliers outside the whiskers.

Column19 experience

#summary statistics
summary(L3Q1_data$experience)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -32.00    0.00    7.00   10.88   20.00   74.00
#distribution
qplot(experience, data = L3Q1_data) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Outliers
qplot(1, experience, data = L3Q1_data, geom = "boxplot")

The summary statistics show that the range of data is from -32 to 74, the mean is 10.88 and median is 7. Histogram shows that these data looks like normal distribution. And the number of people having no experience are greater than other years. The one possible reason is that people don’t choose the experience years and 0 is setted by default.

However there are some outliers whose value is less than 0. The boxplot also shows some outliers above the top whisker.

Column20 age

#summary statistics
summary(L3Q1_data$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6.00   28.00   36.00   35.54   44.00   91.00
#distribution
qplot(age, data = L3Q1_data) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Outliers
qplot(1, age, data = L3Q1_data, geom = "boxplot")

The summary statistics show that the range of data is from -6 to 91, the mean is 35.54 and median is 36. Histogram shows that these data looks like normal distribution. And the number of people about 30 years old are greater than other years. The one possible reason is that people don’t choose the age and 0 is setted by default.

However there are some outliers whose value is less than 0. The boxplot also shows some outliers above the top whisker and below the bottom whisker.

Column21 race

#summary statistics
summary(L3Q1_data$race)
##             arab            asian         hispanic           indian 
##              187              960              353              162 
##           latino            mixed  native american pacific islander 
##             1115              199              256              276 
##          persian            white 
##              376            18032
#distribution
qplot(race, data = L3Q1_data) 

Histogram shows that white is dominant among other races users.

Column22 wage

#summary statistics
summary(L3Q1_data$wage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   13.52   20.36   22.97   28.40  105.00
#distribution
qplot(wage, data = L3Q1_data) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Outliers
qplot(1, wage, data = L3Q1_data, geom = "boxplot")

#normality
qqnorm(L3Q1_data$wage)
qqline(L3Q1_data$wage)

The summary statistics show that the range of data is from 5 to 105, the mean is 22.97 and median is 20.36. Histogram shows that these data looks like normal distribution. And the number of people having wage around 25 are greater than other years.

However there are some outliers whose value is less than 0. The boxplot also shows some outliers above the top whisker.

Column23,24 retweeted_count, retweet_count

  • retweeted_count
#summary statistics
summary(L3Q1_data$retweeted_count)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   1.0000   0.9715   1.0000 705.0000
#distribution
qplot(retweeted_count, data = L3Q1_data) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Outliers
qplot(1, retweeted_count, data = L3Q1_data, geom = "boxplot")

The summary statistics show that the range of data is from 0 to 705, the mean is 0.9715 and median is 1.

The boxplot also show there are at least 5 outliers above the top whisker.

This histogram couldn’t tell us the distribution of the data because the existence of outliers. So I zoom in the range from 0 to 100

#distribution
qplot(retweeted_count, data = L3Q1_data, xlim = c(0,100) ) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8 rows containing non-finite values (stat_bin).

qplot(retweeted_count, data = L3Q1_data, xlim = c(0,100), geom = "density")
## Warning: Removed 8 rows containing non-finite values (stat_density).

The major number of followers are around 0.

However,the histogram of all count column can’t show the distribution of these data. The data need to do some transformation.

  • retweet_count
#summary statistics
summary(L3Q1_data$retweet_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    3.00   52.73   19.00 5506.00
#distribution
qplot(retweet_count, data = L3Q1_data) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Outliers
qplot(1, retweet_count, data = L3Q1_data, geom = "boxplot")

The summary statistics show that the range of data is from 0 to 5506, the mean is 52.73 and median is 3. The median is not equal to the mean which implicate that there are some outlieres in the data.

The boxplot also show there are at least 3 outliers above the top whisker.

This histogram couldn’t tell us the distribution of the data because the existence of outliers. So I zoom in the range from 0 to 1000

#distribution
qplot(retweet_count, data = L3Q1_data, xlim = c(0,1000) ) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 290 rows containing non-finite values (stat_bin).

qplot(retweet_count, data = L3Q1_data, xlim = c(0,1000), geom = "density")
## Warning: Removed 290 rows containing non-finite values (stat_density).

The major number of followers are around 0.

However,the histogram of all count column can’t show the distribution of these data. The data need to do some transformation.

Column25 height

#summary statistics
summary(L3Q1_data$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   165.0   172.0   171.5   178.0   203.0
#distribution
qplot(height, data = L3Q1_data) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Outliers
qplot(1, height, data = L3Q1_data, geom = "boxplot")

The summary statistics show that the range of data is from 1 to 203, the mean is 172 and median is 171.5.

The boxplot also show there are at least 3 outliers above the top whisker and below the bottom whisker.

This histogram couldn’t tell us the distribution of the data because the existence of outliers. So I zoom in the range from 150 to 2000

#distribution
qplot(retweet_count, data = L3Q1_data, xlim = c(150,200) ) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 21472 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

qplot(retweet_count, data = L3Q1_data, xlim = c(150,200), geom = "density")
## Warning: Removed 21472 rows containing non-finite values (stat_density).

  • identify useful raw data & transforms (e.g. log(x))
  • identify data quality problems
  • identify outliers
  • identify subsets of interest
  • suggest functional relationships